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Abstract 

Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas 
of biological research including de novo genome assembly, structural-variant identification, haplotype 
phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of 
SMRT sequences can spur development of analytic tools that can accommodate unique characterisitcs 
of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to 
high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from 
five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crossa, Arabidopsis thaliana, and 
Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI 
Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4-C2 
and P5-C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction 
by the research community to generate whole-genome assemblies, test new algorithms, investigate 
genome structure and evolution, and identify base modifications in some of the most widely-studied 
model systems in biological research. 
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Background and Summary 

Single-molecule, real-time (SMRT®) DNA sequencing occurs by optically detecting a fluorescent signal 
when a nucleotide is being incorporated by a DNA polymerase [1-4]. This relatively new technology 
enables detection of DNA sequences that have unique characteristics, such as long read lengths, lack of 
CG bias, and random error profiles, and can yield highly accurate consensus sequences [5]. Kinetic 
information such as pulse width and interpulse duration are also recorded and can be used to detect 
base modifications [6-8]. 

Since its introduction, investigators have published on a range of applications using SMRT sequencing. 
For example, the developers of GATK (Genome Analysis Toolkit) demonstrated that single nucleotide 
polymorphisms (SNPs) could be detected using SMRT sequences [9, 10] due to their lack of context- 
specific bias and systematic error [5, 10]. Likewise, the developers of PBcR (PacBio error correction) [11, 
12] showed that complete bacterial genome assemblies using SMRT sequence data had greater than 
Q60 base quality [12]. PBcR was later incorporated as the "pre-assembly" step in the HGAP (hierarchical 
genome assembly process) system [13], followed by consensus polishing using the Quiver algorithm [13] 
to produce a complete assembly pipeline for SMRT sequence data. In addition, other third-party tools 
now support long reads for various applications such as mapping [14, 15], scaffolding [16], structural- 
variation discovery [17], and genome assembly [11, 18]. Other applications such as 16S rRNA sequencing 
[19], characterization of entire transcriptomes [20, 21], genome-editing studies [22], base-modification 
studies [7, 8, 23-25], and validation of CRISPR targets [26] have also been published. 

To encourage interest in further applications and tool development for SMRT sequence data, we report 
here the release of whole-genome shotgun-sequence datasets from five model organisms (£. coli, S. 
cerevisiae, N. crassa, A. thaliana, and D. melanogaster). These organisms have among the most 
complete and well-annotated reference genome sequences, due to continual refinement by dedicated 
teams of scientists. Despite continued improvement of these genome sequences with new 
technologies, few are completely finished with fully contiguous assemblies of all chromosomes. The 
gaps remaining arise from complex structures such as transposable elements, repeats, segmental 
duplications, or other dynamic regions of the genome that cannot be easily assembled. Structural 
differences in these regions can account for variability in millions of nucleotides within every genome, 
and mounting evidence suggest that such mutations are important for human diversity and disease 
susceptibility in many complex traits including autism and schizophrenia [27-29]. SMRT sequencing data 
can therefore play an important role in the completion of these and other reference genomes, providing 
a platform for new insights into genome biology. 
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Methods 

We generated eight whole-genome shotgun-sequence datasets from five model organisms using the 
P4C2 or P5C3 polymerase and chemistry combinations, totaling nearly 1000 gigabytes (GB) of raw data 
(See Data Records section). Genomic DNA was either purchased from commercial sources or generously 
provided by collaborators. 

DNA from the reference K12 strain off. coli was purchased from Lofstrand Labs Limited (K12 MG1655 E. 
coli, cat# L3-4001SP2). DNA from the reference OR74A strain of N. crassa was purchased from the 
Fungal Genetics Stock Center (FGSC). A standard Ler-0 strain of A. thaliana plants was grown from seeds 
purchased from Lehle seeds (WT-04-19-02) and DNA was extracted at Pacific Biosciences. The-protocol 
is available on Sample Net [30] and summarized in the organism-specific methods section of this paper. 
DNA from the 9464 strain of S. cerevisiae was provided by J. Li at University of California San Francisco. 
The 9464 strain is a daughter of the reference WG303 strain. DNA from the Tl strain of N. crassa was 
obtained from D. Catcheside at Flinders University who has an interest in polymorphic genes regulating 
recombination. The Tl strain is an A mating type strain which, like OR74A, was derived from a cross 
between the Em a 5297 and Em A 5256 strains . DNA from the ISOl strain [31] of D. melanogaster was 
obtained from S. Celniker at Lawrence Berkeley National Laboratory. This is the reference strain of D. 
melanogaster that was originally chosen to be the first large genome to be sequenced and assembled 
using a whole-genome shotgun approach [32]. It continues to serve as the reference strain in 
subsequent releases and numerous annotations of the D. melanogaster genome. 

DNA extraction methods were species-specific and optimized for each organism (See organism-specific 
methods below). In general, the steps are: (1) remove debris and particulate material, (2) lyse cells, (3) 
remove membrane lipids, proteins and RNA, (4) DNA purification. 

SMRTbell™ libraries for sequencing [9] were prepared using either 10 kb [33, 34] or 20 kb [35] 
preparation protocols to optimize for the most high-quality and longest reads. The main steps for library 
preparation are: (1) Shearing (2) DNA damage repair, (3) blunt end-ligation with hairpin adapters 
supplied in the DNA Template Prep Kit 2.0 (Pacific Biosciences), (4) size selection, and (5) binding to 
polymerase using the DNA Sequencing Kit 3.0 (Pacific Biosciences). 
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Table 1: Summary of DNA Samples. The NCBI sample ID associated with each dataset is provided. DNA 
was extracted in a species-specific manner, yielding genomic DNA of various sizes. All DNA was size 
selected using the Blue Pippin system (Sage Sciences), and select samples were sheared with g-TUBEs 
(Covaris). 



Dataset Name 


Sample ID 


DNA extraction 


gDNA size 
(kb) 


Shearing 


Size 
selection 


E. coli MG1655 P4C2 


SAMN02951645 


ammonium acetate or SDS, 
proteinase K, phenol-chloroform 


10 


none 


Blue Pippin 
(7kb) 


E. coli MG1655 P5C3 


SAMN02743420 


ammonium acetate or SDS, 
proteinase K, phenol-chloroform 


10 


none 


Blue Pippin 
(7kb) 


S. cerevisiae 9464 
P4C2 


SAMN02731377 


contact J. Li at UCSF 


>40 


g-TUBE 


Blue Pippin 
(17kb) 


N. crassa OR74A 
P4C2 


SAMN02724975 


BashingBeads, Zymo Research kit 


6 


none 


Blue Pippin 
(4kb) 


N. crassa 11 P4C2 


SAMN02724976 


SDS, proteinase K, phenol- 
chloroform, RNAase, isopropanol 


15 


none 


Blue Pippin 
(7kb) 


A. thaliana Ler-0 
P5C3 


SAMN02724977 


CTAB, chloroform:isoamyl, 
isopropanol precip. 


>40 


g-TUBE 


Blue Pippin 
(15kb) 


A. thaliana Ler-0 
P4C2 


SAMN02731378 


CTAB, chloroform:isoamyl, 
isopropanol precip. 


>40 


g-TUBE 


Blue Pippin 
(7kb) 


D. melanogaster ISOl 
P5C3 


SAMN02614627 


SDS, phenol-chloroform, CsCI 
banding, ethanol precip. 


>40 


g-TUBE 


Blue Pippin 
(17kb) 



E . coli collection, DNA Extraction, and SMRTbell Library Preparation 

Both P4C2 and P5C3 samples were prepared in the same way. E. coli K12 genomic DNA was ordered and purified by Lofstrand 
Labs Limited (K12MG1655 E. coli, cat# L3-4001SP2). Field Inversion Gel Electrophoresis (FIGE) was run to ensure presence of 
high-molecular-weight gDNA. Ten micrograms of gDNA was sheared using g-TUBE devices (Covaris, Inc) spun at 5500 rpm for 1 
minute. Three microliters of elution buffer (EB) was added to rinse the upper chamber, spun at 6000 rpm, and spun again at 
5500 rpm after inverting the g-TUBE device. SMRTbell libraries were created using the Procedure & Checklist - 20 kb Template 
Preparation using BluePippin™ Size Selection protocol[35]. Briefly, the library was run on a BluePippin system (Sage Science, 
Inc., Beverly, MA, USA) to select for SMRTbell templates greater than 10 kb. The resulting average insert size was 17 kb based 
on 2100 Bioanalyzer instrument (Agilent Technologies Genomics, Santa Clara, CA., USA). Sequencing primers were annealed to 
the hairpins of the SMRTbell templates followed by binding with the P5 sequencing polymerase and MagBeads (Pacific 
Biosciences, Menlo Park, CA, USA). One SMRT Cell was run on the PacBio® RS II system with an on-plate concentration of 150 
pM using P5-C3 chemistry and a 180-minute data-collection mode. 

S. cerevisiae collection, DNA Extraction, and SMRTbell Library Preparation 

Please contact J. Li at University of California, San Francisco to obtain the protocol. 
A. thaliana collection, DNA Extraction, and SMRTbell Library Preparation 

Plants were grown from seeds provided by Lehle seeds (WT-04-19-02). Shoots and leaves were harvested at three weeks and 
ground in liquid nitrogen using a mortar and pestle. The complete protocol is described in the document "Preparing 
Arabidopsis Genomic DNA for Size-Selected ~20 kb SMRTbell™ Libraries" [36]. This protocol can be used to prepare purified 
Arabidopsis genomic DNA for size-selected SMRTbell templates with average insert sizes of 10 to 20 kb. We recommend 
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starting with 20-40 grams of three-week-old Arabidopsis whole plants, which can generate >100 |ig of purified genomic DNA. 
SMRTbell libraries were created using the document "Procedure & Checklist - 20 kb Template Preparation using BluePippin™ 
Size Selection protocol" [35]. Eighty-five SMRT Cells were run on the PacBio RS II system using P4-C2 chemistry and a 180- 
minute data-collection mode. Forty-six SMRT Cells were run on the PacBio RS II system using P5-C3 chemistry and a 180-minute 
data-collection mode. 

N. crassa OR74A, collection, DNA Extraction, and SMRTbell Library Preparation 

The Tl strain of N. crassa, is an A mating type strain derived by DG Catcheside from a cross between the Em a 5297 and Em A 
5256 strains he obtained from Stirling Emerson in 1955. The fungus was grown in shake culture for 72 hr at 25°C in 500 ml 
Vogel's [37] minimal medium containing 2% sucrose. Mycelium was harvested by filtration, ground in liguid nitrogen, 
resuspended in 10 ml of a buffer containing 0.15 M NaCI, 0.1 M EDTA, 2% SDS at pH 9.5, and incubated overnight at 37°C with 1 
mg protease K. Debris was precipitated by centrifugation and 10 ml distilled water was added to the supernatant, which was 
extracted once with an egual volume of water saturated phenol and once with chloroform. Nucleic acids were precipitated from 
the agueous phase with 0.6 volumes of isopropanol. Following centrifugation, the pellet was dried and dissolved in 1 ml TE 
buffer (TRIS 10 mM, 1 mM EDTA pH 8.0). RNA and protein were digested by overnight incubation at 37°C with RNAase (50 ng) 
followed by addition of protease K (50 p.g) and further incubation for 2hr. The digest was extracted once with water-saturated 
phenol and once with chloroform. DNA was collected by precipitation with 0.6 volumes of isopropanol and, following 
centrifugation, the pellet was dried, dissolved in 500 p.1 TE buffer and stored at 4°C. Field Inversion Gel Electrophoresis (FIGE) 
was run to ensure presence of high-molecular-weight gDNA. The genomic DNA was approximately 25 kb and was not sheared. 
SMRTbell libraries were created using the document "Procedure and Checklist - 10 kb Template Preparation and Seguencing 
( with Low-Input DNA)"[33] . Two SMRT Cells were run on the PacBio RS II system using P4C2 chemistry and a 180-minute data 
collection mode. 

N. crassa Tl collection, DNA Extraction, and SMRTbell Library Preparation 

The Tl strain of N. crassa, is an A mating type strain derived by DG Catcheside from a cross between the Em a 5297 and Em A 
5256 strains he obtained from Stirling Emerson in 1955. The fungus was grown in shake culture for 72 hr at 25°C in 500 ml 
Vogel's N [37] minimal medium containing 2% sucrose. Mycelium was harvested by filtration, ground in liquid nitrogen, 
resuspended in 10 ml of a buffer containing 0.15 M NaCI, 0.1 M EDTA, 2% SDS at pH 9.5, and incubated overnight at 37°C with 1 
mg protease K. Debris was precipitated by centrifugation and 10 ml distilled water was added to the supernatant, which was 
extracted once with an equal volume of water-saturated phenol and once with chloroform. Nucleic acids were precipitated 
from the aqueous phase with 0.6 volumes of isopropanol. Following centrifugation, the pellet was dried and dissolved in 1 ml 
TE buffer (TRIS 10 mM, 1 mM EDTA pH 8.0). RNA and protein were digested by overnight incubation at 37°C with RNAase (50 
pg) followed by addition of protease K (50 pg) and further incubation for 2 hr. The digest was extracted once with water 
saturated phenol and once with chloroform. DNA was collected by precipitation with 0.6 volumes of isopropanol and, following 
centrifugation, the pellet was dried, dissolved in 500 pi TE buffer and stored at 4°C. Field Inversion Gel Electrophoresis (FIGE) 
was run to ensure presence of high-molecular-weight gDNA. The genomic DNA was approximately 25 kb and was not sheared. 
SMRTbell libraries were created using the document "Procedure and Checklist - 10 kb Template Preparation and Sequencing 
(with Low-Input DNA)" [33]. Eighteen SMRT Cells were run on the PacBio RS II system using P4-C2 chemistry and a 180-minute 
data-collection mode. 

D. melanogaster collection, DNA Extraction, and SMRTbell Library Preparation 

A total of 1.2 g of adult male ISOl flies corresponding to 1950 animals were collected, starved for 90-120 min and frozen. The 
flies ranged in age from 0-7 days based on four collections (1) 0-2 days old, 500 males, 0.33 g; (2) 0-4 days old, 500 males, 0.29 
g; (3) 0-7 days old, 500 males, 0.29 g; (4) 0-2 days old, 450 males, 0.29 g. Flies were ground in liquid nitrogen to a fine powder 
and genomic DNA was purified by phenol-chloroform extraction and CsCI banding in the ultracentrifuge. Briefly, the pulverized 
fly extract was gently re-suspended in 5 ml of HB buffer (7 M Urea, 2% SDS, 50 mM Tris pH7.5, 10 mM EDTA and 0.35 M NaCI) 
and 5 ml of 1:1 phenol/chloroform. The mixture was shaken slowly for 30 minutes and then centrifuged at 18K rpm for 10 min 
at 20°C. The aqueous phase was re-extracted twice as above and then precipitated by adding two volumes of ethanol and 
centrifuging at 18K rpm for 10 min at 20°C. The pellet was re-suspended in 3 ml of TE (10 mM Tris 1 mM EDTA pH 8.0) by gentle 
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inversion. To the re-suspended DNA, 3 g CsCI and 0.3 ml of 10 mg/ml ethidium bromide (EtBr) were added and the mixture 
centrifuged at 45K rpm for 16 hrs at 15°C. The DNA band was collected and the EtBr removed by extraction with CsCI-saturated 
butanol. The DNA was diluted three-fold with TE, 1/10 vol, 5 M NaCI was added and the DNA precipitated with two volumes of 
ethanol. After centrifugation, the pellet was washed in 70% ethanol. The DNA was resuspended in 100 pi TE at a concentration 
of 1.4 pg/pl and quantified using a Nanodrop instrument. This protocol routinely yields at least 10 ng DNA per mg of flies with 
an estimated DNA size >100 kb. 

Genomic DNA was sheared, using a g-TUBE device (Covaris),at 4800 RPM, 150 ng/pl and purified using 0.45x volume ratio of 
AMPure PB beads. SMRTbell libraries were created using the Procedure & Checklist - 20 kb Template Preparation using 
BluePippin™ Size Selection [35]. Libraries were ligated with excess adapters and an overnight incubation was performed to 
increase the yield of ligated fragments larger than 20 kb. Smaller fragments and adapter dimers were then removed by >15 kb 
size selection using the BluePippin DNA size selection system by Sage Science. Forty-two SMRT Cells were run on the PacBio RS 
II system. The first run was composed of four SMRT Cells, loaded at 75 pM, 150 pM, 300 pM, and 400 pM in order to determine 
the optimal loading concentration of the sample. The remaining 38 SMRT Cells were loaded at 400 pM. 



Data Records 

After DNA extraction, libraries were generated and sequenced at Pacific Biosciences of California, 
uploaded to Amazon Web Services' Simple Storage Service (S3), and then submitted to the Sequence 
Read Archive at NCBI under Project ID SRP040522. The corresponding accession numbers and file sizes 
are listed in Table 1. More detailed information including md5 checksums and links to download the 
original data from AWS S3 are provided in Supplementary Table SI. 

Table 2: Summary of Datasets. Eight datasets from five organisms are described in this paper. Data can 
be accessed from the Sequence Read Archive (SRA) using the accession numbers provided. 



Organism 


Strain 


Origin 


Polymerase 
& Chemistry 
Library kits 


SRA 
Accession 


Size (GB) 


E. coli 


MG1655 


Lofstrand Labs 


P4C2 


SRX669475 


6.0 


E. coli 


MG1655 


Lofstrand Labs 


P5C3 


SRX533603 


3.8 


S. cerevisiae 


9464 


J. Li 


P4C2 


SRX533604 


38 


N. crassa 


OR74A 


FGSC 


P4C2 


SRX533605 


29 


N. crassa 


Tl 


D. Catcheside 


P4C2 


SRX533606 


143 


A thaliana 


Ler-0 


Lehle Seeds 


P4C2 


SRX533608 


263 


A. thaliana 


Ler-0 


Lehle Seeds 


P5C3 


SRX533607 


252 


D. melanogaster 


ISOl 


S. Celniker 


P5C3 


SRX499318 


187 



Raw data was transferred from the instrument to a storage location and organized first by the run name, 
and then by the SMRT Cell directory. Each run contained one or more SMRT Cells. Each SMRT Cell 
produced a metadata. xml file that recorded the run conditions and barcodes of sequencing kits, three 
bax.h5 files that contained base call and quality information of actual sequenced data, and one bas.h5 
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file that acted as a pointer to consolidate the three bax.h5 files. The "h5" suffix denotes that these are 
Hierarchical Data format 5 (HDF5) files. The specific contents and structure of a PacBio bax.h5 file is 
described in more detail in online documentation [38]. 

Recall the "SMRT bell" structure that underwent sequencing was created by the library preparation 
process [9]. Sequenced SMRT Bells corresponded to raw reads that may pass around the same base 
multiple times. A raw read could therefore have a structure that is composed of left adapter-^ DNA 
insert -> right adapter -> reverse complement of DNA insert -> left adapter -> DNA insert -> and so on. 
This raw read is typically processed downstream to remove adapters and create subreads composed of 
the DNA sequence of interest to the investigator. Typical filtering conditions for high-quality SMRT 
sequence data are read score > 0.8, read length > 100, subread length > 500. In addition, the ends of 
reads are trimmed if they are outside of high-quality (HQ) regions, and adapter sequences between 
subreads are removed. 

The post-filter statistics of each dataset are listed in Table 3. While raw read lengths reflect the true 
sequencing capacity of the instrument; only subreads are summarized in Table 3 because it is used in 
downstream analysis algorithms such as de novo assemblers. Multiple subreads can be contained within 
one raw read, and subreads exclude adapters and low quality sequence. N50 is a statistic used to 
describe the length distribution of a collection of reads, contigs, or scaffolds, and is defined as the length 
where 50% of all bases are contained in sequences longer than that length. The N50 filtered subread 
lengths ranged from 7.6 kb to 10.5 kb for datasets generated with P4-C2 chemistry and ranged from 
12.2 kb to 14.2 kb for datasets generated with P5-C3 chemistry. With the exception of N. crassa OR74A, 
all datasets were sequenced to high-coverage (>68X) and sufficient for de novo genome assembly 
applications. 
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Table 3: Summary statistics of filtered data. Results shown for each dataset are based on output of 
SMRT Portal analysis using the default filtering parameters (see text for details). Fold coverage is 
calculated relative to the estimated genome size. 



Dataset Name 


IMUIIlUtrl Ul 

filtered 
subreads 


Men filtororl 
IM JU 1 II Lcl cU 

subread 
length (nt) 


Maximum 
filtered 
subread 

length (nt) 


Total filtered 
subread (nt) 


LbUI IldLcU 

genome 
size (Mb) 


Fold 
coverage 


E. coli MG1655 P4C2 


61,019 


7,586 


22,609 


331,516,965 


5 


66X 


E. coli MG1655 P5C3 


43,063 


12,041 


28,647 


373,874,428 


5 


75X 


S. cerevisiae 9464 P4C2 


269,145 


8,821 


30,164 


1,597,871,118 


12 


133X 


N. crassa OR74A P4C2 


175,926 


7,617 


30,845 


981,884,113 


40 


25X 


N. crassa 11 P4C2 


210,480 


10,462 


36,227 


11,497,185,440 


40 


287X 


A. thaliana Ler-0 P4C2 


1,338,320 


8,769 


41,753 


8,129,670,483 


120 


68X 


A. thaliana Ler-0 P5C3 


2,067,212 


12,188 


47,445 


17,714,447,516 


120 


148X 


D. melnogaster ISOl P5C3 


1,561,929 


14,214 


44,766 


15,194,174,294 


160 


95X 



Technical Validation 

DNA and Sample preparation 

To assess the quality of genomic DNA received, we used Qbit (Life Technologies) and Nanodrop (Thermo 
Scientific) to measure the concentration of genomic DNA. Ideal samples had similar concentration 
estimates on both platforms, with A 230 /26o/23o ratios close to 1:1.8:1, corresponding to what is expected of 
pure DNA. All samples presented here passed this screening criterion. 

Next we assessed the size of the genomic DNA received. For genomic DNA where the size range was 
less than 17kb, we used the Bioanalyzer 21000 (Agilent) to determine the actual size distribution. For 
genomic DNA were the size range was greater than 17kb, we opted for pulse field gel electrophoresis to 
better estimate the larger size distributions. The sizes of the genomic DNA for each sample are listed in 
Table 1. 

To ensure that the library insert sizes were in the optimal size range, we sheared genomic DNA using 
gTubes if the apparent size was greater than 40 kb. Alternatively, if the size was less than 40kb, then the 
DNA was not sheared and carried straight through to library preparation. Extremely small fragments 
(<100bp) and adapter dimers are eliminated by Ampure Beads. Adapter Dimer (0-10bp) and small 
inserts (ll-100bp) represented less than 0.01% of all the reads sequenced in all datasets. We 
additionally use the Blue Pippin (Sage Science) to select ensure that the libraries had a physical size of 
lOkb or greater. The size cutoffs used for each sample are listed in Table 1. 
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Analysis and Quality Filtering 

To assess the quality of the libraries sequenced, we examined the percent of bases filtered by a standard 
QC procedure. Filtering conditions for high-quality SMRT sequence data are read score > 0.8, read 
length > 100, subread length > 500. In addition, the ends of reads are trimmed if they are outside of 
high-quality (HQ) regions, and adapter sequences between subreads are removed. All samples retained 
71-97% of the bases after filtering. 

To ensure that the sequences matched the model organism of interest, we examined the percent of 
post-filter bases that were mapped to the closest reference genome available. All samples had a 
mapping rate of 81-95%, with the exception of the Neurospora Tl sample that had a mapping rate of 
62%. This sample may have some damaged DNA as it had been stored in a freezer for over 20 years. 
Nonetheless, preliminary unpublished results show that the sequence from the Neurosporta Tl sample 
can be successfully assembled into a genome that is more contiguous than the existing reference 
genome for Neurospora [39]. 



Usage Notes 

The datasets described in this paper were first released on DevNet [40], the PacBio Software Developer 
Community Network website, with brief descriptions on the PacBio blog. DevNet typically hosts open- 
source software; SampleNet [30], the PacBio Sample Preparation Community Network website, typically 
hosts protocols for DNA extraction and library preparation. These websites provide valuable data and 
documentation about the technology, but are not considered a part of the traditional academic record. 
This paper in Nature Scientific Data provides an opportunity to describe the methodology and 
characteristics of the eight datasets in more detail, creates a citable entity for the scientific community, 
and allows the data to be continually hosted and maintained by the Sequence Read Archive. 

DNA sequencing instruments and chemistries change rapidly, and PacBio SMRT sequencing is no 
exception. The datasets presented here are from P4-C2 and P5-C3 polymerase-chemistry combinations, 
spanning release dates from late-2013 to early-2014. These datasets represent some of the longest 
read lengths to date for these chemistries, and can be used to benchmark and develop new algorithms 
and the state of the art as the technology evolves. 
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cerevisae 9464 sample. JML deposited data to the SRA. CSC, AP, CMB and JML analyzed the data and 
prepared the manuscript. CMB and JML coordinated the project. 
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